library(plotly)
Attaching package: 㤼㸱plotly㤼㸲
The following object is masked from 㤼㸱package:ggplot2㤼㸲:
last_plot
The following object is masked from 㤼㸱package:stats㤼㸲:
filter
The following object is masked from 㤼㸱package:graphics㤼㸲:
layout
taxi <- fread("train.csv")
head(taxi)
summary(taxi)
id vendor_id pickup_datetime dropoff_datetime passenger_count
Length:1458644 1:678342 Length:1458644 Length:1458644 1 :1033540
Class :character 2:780302 Class :character Class :character 2 : 210318
Mode :character Mode :character Mode :character 5 : 78088
3 : 59896
6 : 48333
4 : 28404
(Other): 65
pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag
Min. :-121.93 Min. :34.36 Min. :-121.93 Min. :32.18 Length:1458644
1st Qu.: -73.99 1st Qu.:40.74 1st Qu.: -73.99 1st Qu.:40.74 Class :character
Median : -73.98 Median :40.75 Median : -73.98 Median :40.75 Mode :character
Mean : -73.97 Mean :40.75 Mean : -73.97 Mean :40.75
3rd Qu.: -73.97 3rd Qu.:40.77 3rd Qu.: -73.96 3rd Qu.:40.77
Max. : -61.34 Max. :51.88 Max. : -61.34 Max. :43.92
trip_duration
Min. : 1
1st Qu.: 397
Median : 662
Mean : 959
3rd Qu.: 1075
Max. :3526282
sd(taxi.refined$trip_duration)
[1] 659.8239
ggplot(data = taxi, aes(taxi$vendor_id, taxi$trip_duration)) + geom_boxplot(outlier.colour = "red")
ggplotly()
We recommend that you use the dev version of ggplot2 with `ggplotly()`
Install it with: `devtools::install_github('hadley/ggplot2')`
We can see that there are a few outliers with vendor #1. These are exponentially higher than the rest of the trips. Most probably these were due to technical glitches. It would be good to remove them from the analysis.
Seems that this plot is the population itself. The SE overlaps with the mean.